Skip to content

gh-109593: ResourceTracker.ensure_running() calls finalizers #109620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 20, 2023

multiprocessing: Reduce the risk of reentrant calls to ResourceTracker.ensure_running() by running explicitly all finalizers before acquiring the ResourceTracker lock.

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Hmm, but _run_finalizers should only be run at process end.

@vstinner
Copy link
Member Author

Oh, test_asyncio.test_events fails with this change: test_get_event_loop_new_process() fails.

@vstinner
Copy link
Member Author

Hmm, but _run_finalizers should only be run at process end.

What is a process end?

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Sorry, I meant interpreter shutdown.

@vstinner
Copy link
Member Author

Hmm, but _run_finalizers should only be run at process end.

Oh in fact, what I want is to trigger a GC collection: I fixed my PR for that.

I confirm that my change fix #109593 (comment) reproducer.

@pitrou: Would you mind to review this updated fix?

multiprocessing: Reduce the risk of reentrant calls to
ResourceTracker.ensure_running() by running explicitly a garbage
collection, to call pending finalizers, before acquiring the
ResourceTracker lock.
@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Ok, this is better, but the problem is that gc.collect is slow and ensure_running is called indirectly when instantiation a multiprocessing lock (see SemLock.__init__).

Instantiating a semaphore is currently 500x faster than a GC collection, and that's an optimistic measurement without a lot of objects allocated:

>>> %timeit gc.collect()
10.7 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit mp.Semaphore()
21.7 µs ± 395 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

@vstinner
Copy link
Member Author

Ok, this is better, but the problem is that gc.collect is slow and ensure_running is called indirectly when instantiation a multiprocessing lock (see SemLock.init).

Do you want to propose a different fix?

My concern is that currently, multiprocessing can hang randomly. IMO it's bad and must be fixed. I prefer a slow multiprocessing than a multiprocessing which hangs randomly.

Also, I have limited interest in developing the most efficient fix. So if you have free cycles, please go ahead :-)

These days, I'm busy fixing tons of buildbot failures:

Obviously, I would be fine with a fast and correct fix for this issue :-)

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Well, TBH, I'm not sure this issue is very common as I don't think I have every seen it elsewhere.
It does deserve fixing, but is not that urgent either (at least not users of Python :-)).

But, yes, I'll try to come up with a fix.

@vstinner
Copy link
Member Author

Well, TBH, I'm not sure this issue is very common as I don't think I have every seen it elsewhere.

It makes fail more and more buildbots, so for me, it's an urgency.

What I mean in my previous message is that I'm considering to fix the issue right now, and we can have time later to revisit the issue and find a better fix (with lower impact on performance).

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

This code isn't new, so it's surprising it's failing "more and more"?

@vstinner
Copy link
Member Author

This code isn't new, so it's surprising it's failing "more and more"?

I modified the CI recently to stop ignoring silently tests failing randomly (FAILURE then SUCCESS): pass new --fail-rerun option to regrtest. It's likely that this hang was silently ignored by CIs previously.

The affected buildbot "PPC64LE Fedora Stable Refleaks 3.x" is blocking Python releases, it's part of STABLE buildbots.

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Well... do you want to undo the change on the failing buildbot until we fix this issue?

@pitrou
Copy link
Member

pitrou commented Sep 20, 2023

Please take a look at alternate PR #109629

@vstinner
Copy link
Member Author

do you want to undo the change on the failing buildbot until we fix this issue?

Oh sure, if i don't have the bandwith to fix regressions, i will undo this change, once we listed failing tests. That would be reasonable.

@vstinner vstinner marked this pull request as draft September 21, 2023 07:53
@vstinner
Copy link
Member Author

Since @pitrou has a better approach, i convert this change to a draft for now.

@vstinner vstinner closed this Oct 3, 2023
@vstinner vstinner deleted the mp_ensure_running branch October 3, 2023 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants